NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Implementation-Oblivious Transparent Checkpoint-Restart for MPI

https://doi.org/10.1145/3624062.3624255

Xu, Yao; Belyaev, Leonid; Jain, Twinkle; Schafer, Derek; Skjellum, Anthony; Cooperman, Gene (November 2023, ACM)

Full Text Available
MANA-2.0: A Future-Proof Design for Transparent Checkpointing of MPI at Scale

https://doi.org/10.1109/SCWS55283.2021.00019

Xu, Yao; Zhao, Zhengji; Garg, Rohan; Khetawat, Harsh; Hartman-Baker, Rebecca; Cooperman, Gene (November 2021, 2021 SC Workshops Supplementary Proceedings (SCWS))

MANA-2.0 is a scalable, future-proof design for transparent checkpointing of MPI-based computations. Its network transparency (“network-agnostic”) feature ensures that MANA-2.0 will provide a viable, efficient mechanism for trans-parently checkpointing MPI applications on current and future supercomputers. MANA-2.0 is an enhancement of previous work, the original MANA, which interposes MPI calls, and is a work in progress intended for production deployment. MANA-2.0 implements a series of new algorithms and features that improve MANA's scalability and reliability, enabling transparent checkpoint-restart over thousands of MPI processes. MANA-2.0 is being tested on today's Cori supercomputer at NERSC using Cray MPICH library over the Cray GNI network, but it is designed to work over any standard MPI running over an arbitrary network. Two widely-used HPC applications were selected to demonstrate the enhanced features of MANA-2.0: GROMACS, a molecular dynamics simulation code with frequent point-to-point communication, and VASP, a materials science code with frequent MPI collective communication. Perhaps the most important lesson to be learned from MANA-2.0 is a series of algorithms and data structures for library-based transformations that enable MPI-based computations over MANA-2.0 to reliably survive the checkpoint-restart transition.
more » « less
Full Text Available
CRAC: checkpoint-restart architecture for CUDA with streams and UVM

Jain, Twinkle; Cooperman, Gene (November 2020, International Conference for High Performance Computing Networking Storage and Analysis)
null (Ed.)
Full Text Available
Docker Container Deployment in Distributed Fog Infrastructures with Checkpoint/Restart

https://doi.org/10.1109/MobileCloud48802.2020.00016

Ahmed, Arif; Mohan, Apoorve; Cooperman, Gene; Pierre, Guillaume (August 2020, IEEE International Conference on Mobile Cloud Computing, Services, and Engineering, MobileCloud (MobileCloud'20))
null (Ed.)
Full Text Available
Sthread: In-Vivo Model Checking of Multithreaded Programs

https://doi.org/10.22152/programming-journal.org/2020/4/13

Cooperman, Gene; Quinson, Martin (February 2020, The Art, Science, and Engineering of Programming)
null (Ed.)
Full Text Available
Improving scalability and reliability of MPI-agnostic transparent checkpointing for production workloads at NERSC

Chouhan, Prashant Singh; Khetawat, Harsh; Resnik, Neil; Jain, Twinkle; Garg, Rohan; Cooperman, Gene; Hartman-Baker, Rebecca; Zhao, Zhengji (February 2021, First International Symposium on Checkpointing for Supercomputing (SuperCheck21))

Checkpoint/restart (C/R) provides fault-tolerant computing capability, enables long running applications, and provides scheduling flexibility for computing centers to support diverse workloads with different priority. It is therefore vital to get transparent C/R capability working at NERSC. MANA, by Garg et. al., is a transparent checkpointing tool that has been selected due to its MPI-agnostic and network-agnostic approach. However, originally written as a proof-of-concept code, MANA was not ready to use with NERSC's diverse production workloads, which are dominated by MPI and hybrid MPI+OpenMP applications. In this talk, we present ongoing work at NERSC to enable MANA for NERSC's production workloads, including fixing bugs that were exposed by the top applications at NERSC, adding new features to address system changes, evaluating C/R overhead at scale, etc. The lessons learned from making MANA production-ready for HPC applications will be useful for C/R tool developers, supercomputing centers and HPC end-users alike.
more » « less
Full Text Available
MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing

https://doi.org/10.1145/3307681.3325962

Garg, Rohan; Price, Gregory; Cooperman, Gene (June 2019, Proc. of 28th Int. Symp. on High Performance Parallel and Distributed Computing (HPDC'19))

Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing problem in HPC. The problem has been complicated by the need to provide checkpoint-restart services for all combinations of an MPI implementation over all network interconnects. This work presents MANA (MPI-Agnostic Network-Agnostic transparent checkpointing), a single code base which supports all MPI implementation and interconnect combinations. The agnostic properties imply that one can checkpoint an MPI application under one MPI implementation and perhaps over TCP, and then restart under a second MPI implementation over InfiniBand on a cluster with a different number of CPU cores per node. This technique is based on a novel "split-process" approach, which enables two separate programs to co-exist within a single process with a single address space. This work overcomes the limitations of the two most widely adopted transparent checkpointing solutions, BLCR and DMTCP/InfiniBand, which require separate modifications to each MPI implementation and/or underlying network API. The runtime overhead is found to be insignificant both for checkpoint-restart within a single host, and when comparing a local MPI computation that was migrated to a remote cluster against an ordinary MPI computation running natively on that same remote cluster.
more » « less
Full Text Available
Job migration in HPC clusters by means of checkpoint/restart

https://doi.org/10.1007/s11227-019-02857-y

Rodríguez-Pascual, Manuel; Cao, Jiajun; Moríñigo, José A.; Cooperman, Gene; Mayo-García, Rafael (October 2019, The Journal of Supercomputing)

Full Text Available
Towards Non-Intrusive Software Introspection and Beyond

https://doi.org/10.1109/IC2E48712.2020.00025

Mohan, Apoorve; Nadgowda, Shripad; Pipaliya, Bhautik; Varma, Sona; Suneja, Sahil; Isci, Canturk; Cooperman, Gene; Desnoyers, Peter; Krieger, Orran; Turk, Ata (April 2020, IEEE International Conference on Cloud Engineering (IC2E))
null (Ed.)
Full Text Available
CRUM: Checkpoint-Restart Support for CUDA's Unified Memory

https://doi.org/10.1109/CLUSTER.2018.00047

Garg, Rohan; Mohan, Apoorve; Sullivan, Michael; Cooperman, Gene (September 2018, Proc. of IEEE Int. Conf. on Cluster Computing (Cluster'18))

Unified Virtual Memory (UVM) was recently introduced with CUDA version 8 and the Pascal GPU. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assuming the presence of virtual memory. Therefore, checkpointing of UVM has become increasing important, especially as NVIDIA CUDA continues to gain wider popularity: 87 of the top 500 supercomputers in the latest listings use NVIDIA GPUs, with a current trend of ten additional NVIDIA-based supercomputers each year. A new scalable checkpointing mechanism, CRUM (Checkpoint-Restart for Unified Memory), is demonstrated for hybrid CUDA/MPI computations across multiple computer nodes. The support for UVM is particularly attractive for programs requiring more memory than resides on the GPU, since the alternative to UVM is for the application to directly copy memory between device and host. Furthermore, CRUM supports a fast, forked checkpointing, which mostly overlaps the CUDA computation with storage of the checkpoint image in stable storage. The runtime overhead of using CRUM is 6% on average, and the time for forked checkpointing is seen to be a factor of up to 40 times less than traditional, synchronous checkpointing.
more » « less
Full Text Available

« Prev Next »

Search for: All records